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Abstract 



The lasso is a popular tool for sparse linear regression, especially for problems in which the 
number of variables p exceeds the number of observations n. But when p > n, the lasso criterion 
is not strictly convex, and hence it may not have a unique minimum. An important question is: 
when is the lasso solution well-defined (unique)? We review results from the literature, which 
show that if the predictor variables are drawn from a continuous probability distribution, then 
there is a unique lasso solution with probability one, regardless of the sizes of n and p. We also 
show that this result extends easily to l\ penalized minimization problems over a wide range of 
loss functions. 

A second important question is: how can we manage the case of non-uniqueness in lasso 
solutions? In light of the aforementioned result, this case really only arises when some of the 
predictor variables are discrete, or when some post-processing has been performed on continuous 
predictor measurements. Though we certainly cannot claim to provide a complete answer to 
such a broad question, we do present progress towards understanding some aspects of non- 
uniqueness. First, we extend the LARS algorithm for computing the lasso solution path to 
cover the non-unique case, so that this path algorithm works for any predictor matrix. Next, 
we derive a simple method for computing the component- wise uncertainty in lasso solutions of 
any given problem instance, based on linear programming. Finally, we review results from the 
literature on some of the unifying properties of lasso solutions, and also point out particular 
forms of solutions that have distinctive properties. 

1 Introduction 

We consider l\ penalized linear regression, also known as the lasso problem (Tibshirani 1996, Chen 
ct al. 1998). Given a response vector y 6 R™, a matrix X S R™ xp of predictor variables, and a 
tuning parameter A > 0, the lasso estimate can be defined as 



The lasso solution is unique when rank(X) = p, because the criterion is strictly convex. This is not 
true when rank(X) < p, and in this case, there can be multiple minimizers of the lasso criterion 
(emphasized by the element notation in ([T|)). Note that when the number of variables exceeds the 
number of observations, p > n, we must have rank(X) < p. 

The lasso is quite a popular tool for estimating the coefficients in a linear model, especially in 
the high-dimensional setting, p > n. Depending on the value of the tuning parameter A, solutions of 
the lasso problem will have many coefficients set exactly to zero, due to the nature of the l\ penalty. 
We tend to think of the support set of a lasso solution f3, written A = supp(/3) C {1, . . .p} and 
often referred to as the active set, as describing a particular subset of important variables for the 
linear model of y on X . Recently, there has been a lot of interesting work legitimizing this claim by 
proving desirable properties of (3 or its active set A, in terms of estimation error or model recovery. 
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Most of this work falls into the setting p > n. But such properties are not the focus of the current 
paper. Instead, our focus somewhat simpler, and at somewhat more of a basic level: we investigate 
issues concerning the uniqueness or non-uniqueness of lasso solutions. 

Let us first take a step back, and consider the usual linear regression estimate (given by A = 
in m), as a motivating example. Students of statistics are taught to distrust the coefficients given 
by linear regression when p > n. We may ask: why? Arguably, the main reason is that the linear 
regression solution is not unique when p > n (or more precisely, when rank(X) < p), and further, 
this non- uniqueness occurs in such a way that we can always find a variable i 6 {1, . . .p} whose 
coefficient is positive at one solution and negative at another. (Adding any element of the null space 
of X to one least squares solution produces another solution.) This makes it generally impossible 
to interpret the linear regression estimate when p > n. 

Meanwhile, the lasso estimate is also not unique when p > n (or when rank(X) < p), but it is 
commonly used in this case, and in practice little attention is paid to uniqueness. Upon reflection, 
this seems somewhat surprising, because non-uniqueness of solutions can cause major problems in 
terms of interpretation (as demonstrated by the linear regression case) . Two basic questions are: 

• Do lasso estimates suffer from the same sign inconsistencies as do linear regression estimates? 
That is, for a fixed A, can one lasso solution have a positive ith coefficient, and another have 
a negative ith coefficient? 

• Must any two lasso solutions, at the same value of A, necessarily share the same support, and 
differ only in their estimates of the nonzero coefficient values? Or can different lasso solutions 
exhibit different active sets? 

Consider the following example, concerning the second question. Here we let n = 5 and p = 10. For 
a particular response y € K 5 and predictor matrix X £ R 5x10 , and A = I, we found two solutions 
of the lasso problem (fTJ), using two different algorithms. These are 

= (-0.893, 0.620, 0.375, 0.497,..., 0) T and 
/3 (2) = (-0.893, 0.869, 0.624, 0, . . . , 0) T , 

where we use ellipses to denote all zeros. In other words, the first solution has support set {1, 2, 3, 4}, 
and the second has support set {1,2,3}. This is not at all ideal for the purposes of interpretation, 
because depending on which algorithm we used to minimize the lasso criterion, we may have consid- 
ered the 4th variable to be important or not. Moreover, who knows which variables may have zero 
coefficients at other solutions? 

In Section [2J we show that if the entries of the predictor matrix X are drawn from a continuous 
probability distribution, then we essentially never have to worry about the latter problem — along 
with the problem of sign inconsistencies, and any other issues relating to non- uniqueness — because 
the lasso solution is unique with probability one. We emphasize that here uniqueness is ensured 
with probability one (over the distribution of X) regardless of the sizes of n and p. This result has 
basically appeared in various forms in the literature, but is perhaps not as well-known as it should 
be. Section [2] gives a detailed review of why this fact is true. 

Therefore, the two questions raised above only need to be addressed in the case that X contains 
discrete predictors, or contains some kind of post-processed versions of continuously drawn predictor 
measurements. To put it bluntly (and save any dramatic tension), the answer to the first question 
is "no" . In other words, no two lasso solutions can attach opposite signed coefficients to the same 
variable. We show this using a very simple argument in Section 2J As for the second question, the 
example above already shows that the answer is unfortunately "yes" . However, the multiplicity of 
active sets can be dealt with in a principled manner, as we argue in Section 01 Here we show how 
to compute lower and upper bounds on the coefficients of lasso solutions of any particular problem 
instance — this reveals exactly which variables are assigned zero coefficients at some lasso solutions, 
and which variables have nonzero coefficients at all lasso solutions. 
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Apart from addressing these two questions, we also attempt to better understand the non-unique 
case through other means. In Section |3l we extend the well-known LARS algorithm for computing 
the lasso solution path (over the tuning parameter A) to cover the non-unique case. Therefore the 
(newly proposed) LARS algorithm can compute a lasso solution path for any predictor matrix X. 
(The existing LARS algorithm cannot, because it assumes that for any A the active variables form 
a linearly independent set, which is not true in general.) The special lasso solution computed by 
the LARS algorithm, also called the LARS lasso solution, possesses several interesting properties 
in the non-unique case. We explore these mainly in Section |31 and to a lesser extent in Section [5] 
Section [5] contains a few final miscellaneous properties relating to non-uniqueness, and the work of 
the previous three sections. 

In this paper, we both review existing results from the literature, and establish new ones, on the 
topic of uniqueness of lasso solutions. We do our best to acknowledge existing works in the literature, 
with citations either immediately preceeding or succeeding the statements of lemmas. The contents 
of this paper were already discussed above, but this was presented out of order, and hence we give 
a proper outline here. We begin in Section [2] by examining the KKT optimality conditions for 
the lasso problem, and we use these to derive sufficient conditions for the uniqueness of the lasso 
solution. This culminates in a result that says that if the entries of X are continuously distributed, 
then the lasso solution is unique with probability one. We also show that this same result holds 
for l\ penalized minimization problems over a broad class of loss functions. Essentially, the rest 
of the paper focuses on the case of a non- unique lasso solution. Section [3] presents an extension of 
the LARS algorithm for the lasso solution path that works for any predictor matrix X (the original 
LARS algorithm really only applies to the case of a unique solution). We then discuss some special 
properties of the LARS lasso solution. Section [4] develops a method for computing component- wise 
lower and upper bounds on lasso coefficients for any given problem instance. In Section [SJ we finish 
with some related properties, concerning the different active sets of lasso solutions, and a necessary 
condition for uniqueness. Section [6] contains some discussion. 

Finally, our notation in the paper is as follows. For a matrix A, we write co\(A), row(A), and 
null(A) to denote its column space, row space, and null space, respectively. We use rank(^4) for the 
rank of A. We use A + to denote the Moore-Penrose pseudoinverse of A, and when A is rectangular, 
this means A + = (A T A) + A T . For a linear subspace L, we write Pl for the projection map onto L. 
Suppose that A £ R™ xp has columns A\, . . . A p e R™, written A = [Ai, . . . A p ]. Then for an index 
set S = {it,-..ik} £ {1) • • - P}> we let As = [Ai 1 , . . . Ai k ]; in other words, As extracts the columns 
of A in S. Similarly, for a vector b £ R p , we let 65 = (b^ , . . . bi r ) T , or in other words, bs extracts 
the components of b in S. We write A-s or b-s to extract the columns or components not in S. 

2 When is the lasso solution unique? 

In this section, we review the question: when is the lasso solution unique? In truth, we only give a 
partial answer, because we provide sufficient conditions for a unique minimizer of the lasso criterion. 
Later, in Section [5j we study the other direction (a necessary condition for uniqueness). 

2.1 Basic facts and the KKT conditions 

We begin by recalling a few basic facts about lasso solutions. 

Lemma 1. For any y, X , and A > 0, the lasso problem (JTJ) has the following properties: 

(i) There is either a unique lasso solution or an (uncountably) infinite number of solutions. 

(ii) Every lasso solution f3 gives the same fitted value Xf3. 

(Hi) If A > 0, then every lasso solution /3 has the same l\ norm, 
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Proof, (i) The lasso criterion is convex and has no directions of recession (strictly speaking, when 
A = the criterion can have directions of recession, but these are directions in which the criterion is 
constant). Therefore it attains its minimum over W (see, for example, Theorem 27.1 of Rockafcllar 
(1970)), that is, the lasso problem has at least one solution. Suppose now that there are two solutions 
/3 (1) and /3 (2) , /3 (1) ^ /3 (2) . Because the solution set of a convex minimization problem is convex, we 
know that + (1 — a)$^ is also a solution for any < a < 1, which gives uncountably many 

lasso solutions as a varies over (0, 1). 

(ii) Suppose that we have two solutions f}M and /3 (2) with Xj3^ ^ X(3<- 2 \ Let c* denote the 
minimum value of the lasso criterion obtained by (3^, (3^. For any < a < 1, we have 

i ||S/ - X{a^ + (i _ a )^))\\l + \\\ a ^ + (1 - a)pW \\x < ac* + (1 - a)c* = c*, 

where the strict inequality is due to the strict convexity of the function f(x) = \\y — a;||| along with 
the convexity of f(x) = \\x\\i. This means that + (1 — a)j3^ attains a lower criterion value 

than c* , a contradiction. 

(iii) By (ii), any two solutions must have the same fitted value, and hence the same squared error 
loss. But the solutions also attain the same value of the lasso criterion, and if A > 0, then they must 
have the same i\ norm. □ 

To go beyond the basics, we turn to the Karush-Kuhn- Tucker (KKT) optimality conditions for 
the lasso problem ([T]). These conditions can be written as 

X T (y - X&) = A 7 , (2) 
/{sign(A)} if /WO 

7 * e \[-i,i] if A = o ' i0 ^ = ^-P- ( 3 ) 

Here 7 £ W is called a subgradient of the function f(x) — \\x\\i evaluated at x = f3. Therefore $ is 
a solution in (fTJ) if and only if /3 satisfies ([2]) and ([3]) for some 7. 

We now use the KKT conditions to write the lasso fit and solutions in a more explicit form. In 
what follows, we assume that A > for the sake of simplicity (dealing with the case A = is not 
difficult, but some of the definitions and statements need to be modified, avoided here in order to 
preserve readibility) . First we define the equicorrelation set £ by 

£ ={ie{l,...p}:\X?(y-Xp)\=\}. (4) 

The equicorrelation set £ is named as such because when y, X have been standardized, £ contains 
the variables that have equal (and maximal) absolute correlation with the residual. We define the 
equicorrelation signs s by 

S = sign(Xj(y-X/3)). (5) 

Recalling ([2]), we note that the optimal subgradient 7 is unique (by the uniqueness of the fit X/3), 
and we can equivalently define £, s in terms of 7, as in £ = {i G {1, . . .p} : | 7 ;| = 1} and s = 75. 
The uniqueness of X/3 (or the uniqueness of 7) implies the uniqueness of £, s. 

By definition of the subgradient 7 in ([3]), we know that /3_£ = for any lasso solution (3. Hence 
the £ block of ([2|) can be written as 

X T £ (y - X £ fe) = As. (6) 
This means that As £ row(Xf), so As = Xj (Xj) + As. Using this fact, and rearranging (|6]), we get 

XjX £ p £ = Xj(y-(Xj)+\s). 
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Therefore the (unique) lasso fit Xf3 = X £ f3 £ is 

Xp = X £ (X £ )+(y-(Xj)+Xs), (7) 

and any lasso solution /? is of the form 

Ps = and p e = (X £ )+(y-(XT)+\ S ) + 6, (8) 

where 6 £ mill(X £ ). In particular, any 6 6 null(Xf) produces a lasso solution /3 in (JSJ provided that 
/? has the correct signs over its nonzero coefficients, that is, sign(/3i) = Sj for all ft 7^ 0. We can 
write these conditions together as 

b e nuU(X f ) and ■ ([(X £ )+(y - (Xj)+As)] . + 6,) > for i £ £, (9) 

and hence any b satisfying (|9]) gives a lasso solution /? in ((8]) . In the next section, using a sequence 
of straightforward arguments, we prove that the lasso solution is unique under somewhat general 
conditions. 

2.2 Sufficient conditions for uniqueness 

From our work in the previous section, we can see that if null(X^) = {0}, then the lasso solution is 
unique and is given by ([8]) with b — 0. (We note that 6 = necessarily satisfies the sign condition 
in ((9]), because a lasso solution is guaranteed to exist by Lemma [TJ) Then by rearranging ((8|), done 
to emphasize the rank of X £ , we have the following result. 

Lemma 2. For any y,X, and X > 0, i/null(Xg) = {0}, or equivalently i/rank(Xf) = \£\, then the 
lasso solution is unique, and is given by 

/L £ =0 and p £ = (X £ r X e )- 1 {Xjy-Xs), (10) 

where £ and s are the equicorrelation set and signs as defined in Q and (0. Note that this solution 
has at most min{n,p} nonzero components. 

This sufficient condition for uniqueness has appeared many times in the literature. For example, 
see Osborne et al. (20006), Fuchs (2005), Wainwright (2009), Candes k Plan (2009). We will show 
later in Section [5] that the same condition is actually also necessary, for all almost every y £ W 1 . 

Note that £ depends on the lasso solution at y, X, X, and hence the condition mi\\(X £ ) — {0} is 
somewhat circular. There are more natural conditions, depending on X alone, that imply iml\(X £ ) = 
{0}. To see this, suppose that nu\\(X £ ) ^ {0}; then for some i £ £ , we can write 

Xi = CjXj, 

where cj £ R, j £ £ \ {i}. Hence, 

3t£\{i} 

By definition of the equicorrelation set, Xjr — SjX for any j £ £, where r = y — Xf3 is the lasso 
residual. Taking the inner product of both sides above with r, we get 

j££\{i} 
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or 

assuming that A > 0. Therefore, we have shown that if null(Xg) ^ {0}, then for some i € £, 

SiX^ — ^ ^ Ojj • SjXj, 

jes\{i} 

with a i = 1) which means that s^A^ lies in the afhne span of SjXj, j £ £ \ {i}. Note that 

we can assume without a loss of generality that £ \ {i} has at most n elements, since otherwise we 
can simply repeat the above arguments replacing £ by any one of its subsets with n + 1 elements; 
hence the affine span of SjXj, j E £ \ {i} is at most n — 1 dimensional. 

We say that the matrix X 6 R nxp has columns in general position if any affine subspace L C W 1 of 
dimension k < n contains contains no more than k + 1 elements of the set {±ATi, . . . ±X p }, excluding 
antipodal pairs. Another way of saying this: the affine span of any k + 1 points a\Xi x , . . . a k +iXi k+1 , 
for arbitrary signs ax, . . . Ofc+i <E { — 1, 1}, does not contain any element of {±Xi : i ^ ii, . . . ik+i}- 
From what we have just shown, the predictor matrix X having columns in general position is enough 
to ensure uniqueness. 

Lemma 3. // the columns of X are in general position, then for any y and A > 0, the lasso solution 
is unique and is given by (|10[) . 

This result has also essentially appeared in the literature, taking various forms when stated for 
various related problems. For example, Rosset et al. (2004) give a similar result for general convex 
loss functions. Dossal (2012) gives a related result for the noiseless lasso problem (also called basis 
pursuit). Donoho (2006) gives results tying togther the uniqueness (and equality) of solutions of the 
noiseless lasso problem and the corresponding £q minimization problem. 

Although the definition of general position may seem somewhat technical, this condition is nat- 
urally satisfied when the entries of the predictor matrix X are drawn from a continuous probability 
distribution. More precisely, if the entries of X follow a joint distribution that is absolutely con- 
tinuous with respect to Lebesgue measure on R np , then the columns of X are in general position 
with probability one. To see this, first consider the probability P(Xk+2 G aff{Xi, . . . Xk+i}), where 
afflXi, . . . Xk+i} denotes the affine span of Xi, . . . Xk+i- Note that, by continuity, 

P(X k+2 e aS{X 1 ,...X k+1 }\X 1 ,...X k+1 ) = 0, 

because (for fixed X%, . . . Xk+i) the set aff{Ai, . . . X k +i} C ]R" has Lebesgue measure zero. There- 
fore, integrating over X±, . . .X k +i, we get that P(X k +2 £ affl^i, . . . X k +i}) — 0. Taking a union 
over all subsets of k + 2 columns, all combinations of k + 2 signs, and all k < n, we conclude that 
with probability zero the columns are not in general position. This leads us to our final sufficient 
condition for uniqueness of the lasso solution. 

Lemma 4. If the entries of X £ R nxp are drawn from a continuous probability distribution on R" p , 
then for any y and A > 0, the lasso solution is unique and is given by (|10[) with probability one. 

According to this result, we essentially never have to worry about uniqueness when the predictor 
variables come from a continuous distribution, regardless of the sizes of n and p. Actually, there 
is nothing really special about t\ penalized linear regression in particular — we show next that the 
same uniqueness result holds for £i penalized minimization with any diffcrcntiablc, strictly convex 
loss function. 
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2.3 General convex loss functions 



We consider the more general minimization problem 

$ G argmin + \\\/3\\ u (11) 

where the loss function / : R™ — ► R is differentiable and strictly convex. To be clear, we mean that 
/ is strictly convex in its argument, so for example the function f(u) = \\y — is strictly convex, 
even though f(X(3) = \\y — Xf3\\2 may not be strictly convex in /3. 

The main ideas from Section 12.11 carry over to this more general problem. The arguments given 
in the proof of Lemma [1] can be applied (relying on the strict convexity of /) to show that the same 
set of basic results hold for problem (jllj) : (i) there is either a unique solution or uncountably many 
solutions!]] (ii) every solution (3 gives the same fit X(3; (iii) if A > 0, then every solution /3 has the 
same l\ norm. The KKT conditions for ([TT|) can be expressed as 

X T (-Vf)(Xp) = A 7 , (12) 
J{sign(A)} ifft^O 

7iG \[-i,i] if A = o ' for * = 1 >-^ ( 13 ) 

where V/ : R™ — > R™ is the gradient of /, and we can define the equicorrelation set and signs in the 
same way as before, 

£ = {ie{l,.. . P } : \Xj{-Vf){Xp)\ = A}, 

and 

s = sign(Xj(-V/)(X/3)). 

The subgradient condition (|13|) implies that /3_£ = for any solution /? in (jTTJ) . For squared error 
loss, recall that we then explicitly solved for /3g as a function of £ and s. This is not possible for a 
general loss function /; but given £ and s, we can rewrite the minimization problem (jTTJ) over the 
coordinates in £ as 

$ e G argmin f(X e Ps) + M\fie\\l- (14) 

Now, if null(Xf) = {0} (equivalently rank(X^) = \£\), then the criterion in ([M)) is strictly convex, 
as / itself is strictly convex. This implies that there is a unique solution in (fhT| . and therefore a 
unique solution (3 in (fTT|) . Hence, we arrive at the same conclusions as those made in Section |2~2"1 
that there is a unique solution in (jTTJ) if the columns of X are in general position, and ultimately, 
the following result. 

Lemma 5. If X E R nxp has entries drawn from a continuous probability distribution on R np ; then 
for any differentiable, strictly convex function f , and for any A > 0, the minimization problem (lll[) 
has a unique solution with probability one. This solution has at most min{n, p} nonzero components. 

This general result applies to any differentiable, strictly convex loss function /, which is quite a 
broad class. For example, it applies to logistic regression loss, 



f( u ) = ^ [~ ym + log (l + exp(iti))] , 



1 To be precise, if A = then problem may not have a solution for an arbitrary differentiable, strictly convex 
function /. This is because / may have directions of recession that are not directions in which / is constant, and 
hence it may not attain its minimal value. For example, the function f(u) = e~ u is differentiable and strictly convex 
on R, but does not attain its minimum. Therefore, for A = 0, the statements in this section should all be interpreted 
as conditional on the existence of a solution in the first place. For A > 0, the l\ penalty gets rid of this issue, as the 
criterion in has no directions of recession, implying the existence of a solution. 



7 



where typically (but not necessarily) each yi £ {0, 1}, and Poisson regression loss, 

n 

f( u ) = X] [ _ mui + ex p( u *)] ' 

i=l 

where typically (but again, not necessarily) each j/jGl = {0, 1,2,.. .}. 

We shift our focus in the next section, and without assuming any conditions for uniqueness, we 
show how to compute a solution path for the lasso problem (over the regularization parameter A). 

3 The LARS algorithm for the lasso path 

The LARS algorithm is a great tool for understanding the behavior of lasso solutions. (To be clear, 
here and throughout we use the term "LARS algorithm" to refer to the version of the algorithm 
that computes the lasso solution path, and not the version that performs a special kind of forward 
variable selection.) The algorithm begins at A = oo, where the lasso solution is trivially G W. 
Then, as the parameter A decreases, it computes a solution path /3 LARS (A) that is piecewise linear and 
continuous as a function of A. Each knot in this path corresponds to an iteration of the algorithm, 
in which the path's linear trajectory is altered in order to satisfy the KKT optimality conditions. 
The LARS algorithm was proposed (and named) by Efron et al. (2004), though essentially the same 
idea appeared earlier in the works of Osborne et al. (2000a) and Osborne et al. (2000&). It is worth 
noting that the LARS algorithm (as proposed in any of these works) assumes that rank(A^) = \£\ 
throughout the lasso path. This is not necessarily correct when rank(A) < p, and can lead to errors 
in computing lasso solutions. (However, from what we showed in Section [2] this "naive" assumption 
is indeed correct with probability one when the predictors are drawn from a continuous distribution, 
and this is likely the reason why such a small oversight has gone unnoticed since the time of the 
original publications.) 

In this section, we extend the LARS algorithm to cover a generic predictor matrix Though 
the lasso solution is not necessarily unique in this general case, and we may have rank(Xf ) < \£\ 
at some points along path, we show that a piecewise linear and continuous path of solutions still 
exists, and computing this path requires only a simple modification to the previously proposed LARS 
algorithm. We describe the algorithm and its steps in detail, but delay the proof of its correctness 
until Appendix IA.1I We also present a few properties of this algorithm and the solutions along its 
path. 

3.1 Description of the LARS algorithm 

We start with an overview of the LARS algorithm to compute the lasso path (extended to cover an 
arbitrary predictor matrix X), and then we describe its steps in detail at a general iteration k. The 
algorithm presented here is of course very similar to the original LARS algorithm of Efron et al. 
(2004). The key difference is the following: if XjX £ is singular, then the KKT conditions over the 
variables in £ no longer have a unique solution, and the current algorithm uses the solution with the 
minimum £2 norm, as in (|15j) and (|16|) . This seemingly minor detail is the basis for the algorithm's 
correctness in the general X case. 

Algorithm 1 (The LARS algorithm for the lasso path). 

Given y and X . 

• Start with the iteration counter k — 0, regularization parameter Xq — 00, equicorrelation set 
£ = 0, and equicorrelation signs s = 0. 

2 The description of this algorithm and its proof of correctness previously appeared in Appendix B of the author's 
doctoral dissertation (Tibshirani 2011). 
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While A fc > 0: 

1. Compute the LARS lasso solution at Afc by least squares, as in (|15p and (|16p . Continue 
in a linear direction from the solution for A < Afc . 

2. Compute the next joining time X£ +1) when a variable outside the equicorrelation set 
achieves the maximal absolute inner product with the residual, as in (|17p and (1181) . 

3. Compute the next crossing time A"^ s , when the coefficient path of an equicorrelation 
variable crosses through zero, as in (|19j) and (|20|) . 



4- Set Afc+i = max{A]°J_" , A£™f}. // A J fe °J_" > A^,™^ , then add the joining variable to £ and 
its sign to s; otherwise, remove the crossing variable from £ and its sign from s. Update 
k = k + l. 

At the start of the A;th iteration, the regularization parameter is A = Afc . For the path's solution 
at Afc, we set the non-equicorrelation coefficients equal to zero, 0^ (Afc) = 0, and we compute the 
equicorrelation coefficients as 

/t ARS (Afc) = (X £ ) + {y-(Xj)+X k s) = c-X k d, (15) 

where c = (X £ ) + y and d — (X £ ) + (Xj) + s = (Xj X £ ) + s are defined to help emphasize that this is 
a linear function of the regularization parameter. This estimate can be viewed as the minimum £2 
norm solution of a least squares problem on the variables in £ (in which we consider £, s as fixed): 

^AR S(Afe) = argmin Lp £ || 2 . p £ G argmin h _ {X T )+Xks _ XefcWl). (16) 

Now we decrease A, keeping /3^g RS (A) = 0, and letting 

fe ARS (X)=c-Xd, 



that is, moving in the linear direction suggested by (|15j) . As A decreases, we make two important 
checks. First, we check when (that is, we compute the value of A at which) a variable outside the 
equicorrelation set £ should join £ because it attains the maximal absolute inner product with the 
residual — we call this the next joining time A J fc °|_" . Second, we check when a variable in £ will have 
a coefficient path crossing through zero — we call this the next crossing time A^,™^ 5 . 
For the first check, for each i ^ £ , we solve the equation 

Xf(y-X s {c-Xd))=±X. 

A simple calculation shows that the solution is 

>in _ Xj(X e c - y) _ Xf(X £ {X £ )+ - I)y 



XTX £ d±l XT(Xj)+s±l 



(17) 



called the joining time of the ith variable. (Although the notation is ambiguous, the quantity ^ om is 
uniquely defined, as only one of +1 or —1 above will yield a value in [0, Afc]). Hence the next joining 
time is 

Ai^maxtf", (18) 
its 



and the joining coordinate and its sign are 



•join „ 

i k+1 = argmax 



4 oin and ^=^ n fe{y-^ LARS (4ti)})' 
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As for the second check, note that a variable i 6 £ will have a zero coefficient when A = Ci/di = 
[(Xs) + y}i/[(XjX£) + s]i. Because we are only considering A < A&, we define the crossing time of the 
ith variable as 

t cro SS = I [(X£X £ )+sU 11 KXJXTJ+sU fc L U > ^ 

otherwise. 

The next crossing time is therefore 

\ cross icross /on\ 

A fe+1 =max* i , (20) 
and the crossing coordinate and its sign are 



= argmax t™ OBS and sj£f = s t 



Finally, we decrease A until the next joining time or crossing time — whichever happens first — by 
setting Afc + i = max{A]°|_" , A™| s }. If A J fe °I" > A£™f s , then we add the joining coordinate to £ 
and its sign s J fc ™" to s. Otherwise, we delete the crossing coordinate i"?^ from £ and its sign s£ r _j^ s 
from s. 

The proof of correctness for this algorithm shows that computed path /3 LARS (A) satisfies the 
KKT conditions ([2]) and (|3]) at each A, and is hence indeed a lasso solution path. It also shows that 
the computed path is continuous at each knot in the path A^, and hence is globally continuous in 
A. The fact that XjX e can be singular makes the proof somewhat complicated (at least more so 
than it is for the case rank(X) = p), and hence we delay its presentation until Appendix lA.il 



3.2 Properties of the LARS algorithm and its solutions 

Two basic properties of the LARS lasso path, as mentioned in the previous section, are piecewise 
linearity and continuity with respect to A. The algorithm and the solutions along its computed path 
possess a few other nice properties, most of them discussed in this section, and some others later in 
Section [5] We begin with a property of the LARS algorithm itself. 

Lemma 6. For any y,X, the LARS algorithm for the lasso path performs at most 




iterations before termination. 

Proof. The idea behind the proof is quite simple, and was first noticed by Osborne et al. (2000a) for 
their homotopy algorithm: any given pair of equicorrelation set £ and sign vector s that appear in 
one iteration of the algorithm cannot be revisited in a future iteration, due to the linear nature of 
the solution path. To elaborate, suppose that £, s were the equicorrelation set and signs at iteration 
k and also at iteration k' , with k' > k. Then this would imply that the constraints 

\Xj(y - X £ ^ ARS (\)) | < A for each i <£ £, (21) 
s . . ^larS( A ) •> o for each i E £, (22) 

hold at both A = A& and A = \k> ■ But /3g ARS (A) = c — Ac? is a linear function of A, and this implies 
that (|2T]) and (|22|) also hold at every A € [A' fc , Afc], contradicting the fact that k' and k are distinct 
iterations. Therefore the total number of iterations performed by the LARS algorithm is bounded 
by the number of distinct pairs of subsets £ C {l,.,.p} and sign vectors se{- 1, L □ 
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Remark. Mairal & Yu (2012) showed recently that the upper bound for the number of steps taken 
by the original LARS algorithm, which assumes that rank(Yg) = \£\ throughout the path, can 
actually be improved to (3 P + l)/2. Their proof is based on the following observation: if £, s are 
the equicorrelation set and signs at one iteration of the algorithm, then £ , — s cannot appear as 
the equicorrelation set and signs in a future iteration. Indeed, this same observation is true for 
the extended version of LARS presented here, by essentially the same arguments. Hence the upper 
bound in Lemma [6] can also be improved to (3 P + l)/2. Interestingly, Mairal & Yu (2012) further 
show that this upper bound is tight: they construct, for any p, a problem instance {y and X) for 
which the LARS algorithm takes exactly (3 P + l)/2 steps. 

Next, we show that the end of the LARS lasso solution path (A = 0) is itself an interesting least 
squares solution. 

Lemma 7. For any y,X, the LARS lasso solution converges to a minimum l\ norm least squares 
solution as A — > + , that is, 

lim /3 LARS (A) = £ LS ^, 

A-S-0+ 

where /3 LS ' fl g argmin^ eIR p \\y — X^W^ and achieves the minimum i\ norm over all such solutions. 

Proof. First note that by Lemma [SI the algorithm always takes a finite number of iterations before 
terminating, so the limit here is always attained by the algorithm (at its last iteration). Therefore we 
can write /3 LARS (0) = lim A _ j . + /3 LARS (A). Now, by construction, the LARS lasso solution satisfies 

\XJ(y - Y/3 LARS (A)) | < A for each i = l,...p, 

at each A e [0, oo]. Hence at A = we have 

Xf(y - Y/? LARS (0)) = for each i = l,...p, 

implying that /3 LARS (0) is a least squares solution. Suppose that there exists another least squares 
solution f3 LS with ||/3 LS ||i < ||/3 LARS (0)||i. Then by continuity of the LARS lasso solution path, 
there exists some A > such that still ||/3 LS ||i < ||/3 LARS (A)||i, so that 

illlZ-^lll + All^ll^illy-X^CAJIlS + All^CAJIU. 

This contradicts the fact that /3 LARS (A) is a lasso solution at A, and therefore /3 LARS (0) achieves the 
minimum l\ norm over all least squares solutions. □ 

We showed in Section l3~T1 that the LARS algorithm constructs the lasso solution 

/^ ARS (A) = and ft A ™(\) = (Xe) + (y-(Xj)+\s), 

by decreasing A from oo, and continually checking whether it needs to include or exclude variables 
from the equicorrelation set £. Recall our previous description ([5} of the set of lasso solutions at 
any given A. In ((5J, different lasso solutions are formed by choosing different vectors b that satisfy 
the two conditions given in ([9]): a null space condition, b E null(Xg), and a sign condition, 

s 4 • ([{X e ) + (y - (Xj)+\s)] t + bi) > for i e £. 

We see that the LARS lasso solution corresponds to the choice b = 0. When rank(Y) = \£\, 6 = 
is the only vector in null(Yg), so it satisfies the above sign condition by necessity (as we know that 
a lasso solution must exist Lemma [I}. On the other hand, when rank(Y) < \£\, it is certainly true 
that £ mill(Xg), but it is not at all obvious that the sign condition is satisfied by b — 0. The 
LARS algorithm establishes this fact by constructing an entire lasso solution path with exactly this 
property (6 = 0) over A S [0, oo]. At the risk of sounding repetitious, we state this result next in the 
form of a lemma. 
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Lemma 8. For any y,X, and A > 0, a lasso solution is given by 

^=0 and ftf** = (Xe) + {y-(XZ)+\8), (23) 
and this is the solution computed by the LARS lasso path algorithm. 

For one, this lemma is perhaps interesting from a computational point of view: it says that for 
any y,X, and A > 0, a lasso solution (indeed, the LARS lasso solution) can be computed directly 
from £ and s, which themselves can be computed from the unique lasso fit. Further, for any y,X, 
we can start with a lasso solution at A > and compute a local solution path using the same LARS 
steps; see Appendix IA.2I for more details. Aside from computational interests, the explicit form of 
a lasso solution given by Lemma [8] may be helpful for the purposes of mathematical analysis; for 
example, this form is used by Tibshirani & Taylor (2012) to give a simpler proof of the degrees of 
freedom of the lasso fit, for a general X, in terms of the equicorrelation set. As another example, 
it is also used in Section [5] to prove a necessary condition for the uniqueness of the lasso solution 
(holding almost everywhere in y). 

We show in Section [5] that, for almost every y £ R™, the LARS lasso solution is supported on all 
of £ and hence has the largest support of any lasso solution (at the same y,X,X). As lasso solutions 
all have the same £\ norm, by Lemma [TJ this means that the LARS lasso solution spreads out the 
common l\ norm over the largest number of coefficients. It may not be surprising, then, that the 
LARS lasso solution has the smallest £2 norm among lasso solutions, shown next. 

Lemma 9. For any y,X, and A > 0, the LARS lasso solution (j hA ^- s has the minimum I2 norm 
over all lasso solutions. 

Proof. From ((5J), we can see that any lasso solution has squared £2 norm 

\W\\ 2 2^\\(X £ ) + (y-(xJ)+\ s )\\ 2 2 + \\b\\l 

since b G null(X f ). Hence \\f3\\j > ||/3 LARS |||, with equality if and only if b = 0. □ 

Mixing together the £\ and £2 norms brings to mind the elastic net (Zou & Hastie 2005), which 
penalizes both the £\ norm and the squared £2 norm of the coefficient vector. The elastic net utilizes 
two tuning parameters Ai,A2 > (this notation should not to be confused with the knots in the 
LARS lasso path), and solves the criteriorH 

/3 EN 6 argmin \\\y + X 1 \\f3\\ 1 + ^||/3|||. (24) 

For any A2 > 0, the elastic net solution /3 EN = /3 EN (Ai,A2) is unique, since the criterion is strictly 
convex. 

Note that if A2 = 0, then (f2"4"|) is just the lasso problem. On the other hand, if Ai = 0, then (j2"4")) 
reduces to ridge regression. It is well-known that the ridge regression solution /3 rldge (A2) = /3 EN (0, A2) 
converges to the minimum £2 norm least squares solution as A2 — > + . Our next result is analogous 
to this fact: it says that for any fixed Ai > 0, the elastic net solution converges to the minimum £2 
norm lasso solution — that is, the LARS lasso solution — as A2 — > + , 

Lemma 10. Fix any X and Ai > 0. For almost every y G R", the elastic net solution converges to 
the LARS lasso solution as A2 — > + . that is, 

Km /3 EN (A 1 ,A 2 )=/3 LARS (A 1 ). 

A 2 — S-0+ 

3 This is actually what Zou & Hastie (2005) call the "naive" elastic net solution, and the modification (1 + A2)/3 EN is 
what the authors refer to as the elastic net estimate. But in the limit as A2 — > + , these two estimates are equivalent, 
so our result in Lemma 1 101 holds for this modified estimate as well. 
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Proof. By Lemma [T31 we know that for any y ^ TV, where TV C R™ is a set of measure zero, the 
LARS lasso at Ai satisfies /3 LARS (Ai)i ^ for all i e £. Hence fix y ^ TV. First note that we can 
rewrite the LARS lasso solution as 

^AR s (Al)=0 and /3 £ LARS (A 1 ) = (AjAr £ )+(Ajy-A lS ). 

Define the function 

/(A 2 ) = (XjX £ + X 2 I)- 1 (Xjy - X lS ) for A 2 > 0, 
f(0) = (XjX £ )+(Xjy~\ lS ). 

For fixed £ , s, the function / is continuous on [0, oo) (continuity at can be verified, for example, 
by looking at the singular value decomposition of (XjX £ + \ 2 I)~ 1 .) Hence it suffices to show that 
for small enough A 2 > 0, the elastic net solution at Ai, A 2 is given by 

P™{\ 1 ,\ 2 )=Q and /3 EN (Ai,A 2 ) = /(A 2 ). 

To this end, we show that the above proposed solution satisfies the KKT conditions for small 
enough A 2 . The KKT conditions for the elastic net problem are 

A T (y-A/3 EN )-A 2 /3 EN = A l7 , (25) 
'{sign(/3f N )} if/3f N ^0 
-1, 1] if /3f N = 



7» G i r 1 nl :r a en n> torz = l,...p. (26) 



Recall that /(0) = /3g Kb (Ai) are the equicorrelation coefficients of the LARS lasso solution at Ai. 
As y £ A/", we have f(0)i ^ for each i 6 £ , and further, sign(/(0)i) = for all i G £. Therefore 
the continuity of / implies that for small enough A 2 , f{X 2 )i ^ and sign(/(A 2 )i) = Si for all i E £. 
Also, we know that || X^ £ (y — X £ f(0)) ||oo < Ai by definition of the equicorrelation set £, and again, 
the continuity of / implies that for small enough A 2 , \\X^ £ (y — A£/(A 2 ))||oo < Ai. Finally, direct 
calculcation shows that 

Xj(y - X £ f(X 2 )) - A 2 /(A 2 ) = Xjy - {Xj X £ + X 2 I){Xj X £ + X 2 I)- x Xjy + 

(XjX £ + X 2 I)(XjX £ + X^X.s 
= Ais. 

This verifies the KKT conditions for small enough A 2 , and completes the proof. □ 

In Section[5l we discuss a few more properties of LARS lasso solutions, in the context of studying 
the various support sets of lasso solutions. In the next section, we present a simple method for 
computing lower and upper bounds on the coefficients of lasso solutions, useful when the solution is 
not unique. 



4 Lasso coefficient bounds 

Here we again consider a general predictor matrix X (not necessarily having columns in general 
position), so that the lasso solution is not necessarily unique. We show that it is possible to compute 
lower and upper bounds on the coefficients of lasso solutions, for any given problem instance, using 
linear programming. We begin by revisiting the KKT conditions. 
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4.1 Back to the KKT conditions 



The KKT conditions for the lasso problem were given in (0) and Recall that the lasso fit Xf3 is 
always unique, by Lemma [TJ Note that when A > 0, we can rewrite ^ as 

l=\x T (y-XP), 

implying that the optimal subgradient 7 is itself unique. According to its definition ([3|), the compo- 
nents of 7 give the signs of nonzero coefficients of any lasso solution, and therefore the uniqueness 
of 7 immediately implies the following result. 

Lemma 11. For any y, X , and A > 0, any two lasso solutions and f3^ must satisfy -fy > 
for i = 1, . . .p. In other words, any two lasso solutions must have the same signs over their common 
support. 

In a sense, this result is reassuring — it says that even when the lasso solution is not necessarily 
unique, lasso coefficients must maintain consistent signs. Note that the same is certainly not true 
of least squares solutions (corresponding to A — 0), which causes problems for interpretation, as 
mentioned in the introduction. Lemma [TT] will be helpful when we derive lasso coefficient bounds 
shortly. 

We also saw in the introduction that different lasso solutions (at the same y, X, A) can have 
different supports, or active sets. The previously derived characterization of lasso solutions, given 
in and ©, provides an understanding of how this is possible. It helps to rewrite ([5]) and © as 

P- £ =0 and h = /3 LARS + b, (27) 

where b is subject to 

b e mA\{X £ ) and s, ■ 0f ARS + h) > 0, ie £, (28) 

and /3 LARS is the fundamental solution traced by the LARS algorithm, as given in (l23l) . Hence for 
for a lasso solution (3 to have an active set A — supp(/3), we can see that we must have A C £ and 
Ps = /3f ARS + b, where b satisfies (|28) and also 

h = -/3 LARS for j £ £ \ A, 
h ? -/3 LARS for ie£\A. 

As we discussed in the introduction, the fact that there may be different active sets corresponding 
to different lasso solutions (at the same y, X, A) is perhaps concerning, because different active sets 
provide different "stories" regarding which predictor variables are important. One might ask: given 
a specific variable of interest i £ £ (recalling that all variables outside of £ necessarily have zero 
coefficients), is it possible for the ith coefficient to be nonzero at one lasso solution but zero at 
another? The answer to this question depends on the interplay between the constraints in (|28t , and 
as we show next, it is achieved by solving a simple linear program. 

4.2 The polytope of solutions and lasso coefficient bounds 

The key observation here is that the set of lasso solutions defined by (|27|) and (f28| forms a convex 
polytope. Consider writing the set of lasso solutions as 

= and p £ eK = {xe R |£| : Px = /3g ARS , Sx > 0}, (29) 

where P = P TOVJ (x £ ) an d S — diag(s). That ([29]) is equivalent to (1271) and ([28)) follows from the fact 
that /3^ ARS e row(X f ), hence Px = /3^ ARS if and only if x = /3^ ARS + b for some b £ null(X £ ). 
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The set K C KJ £ I is a polyhedron, since it is defined by linear equalities and inequalities, and 
furthermore it is bounded, as all lasso solutions have the same l\ norm by Lemma [1] making it a 
polytope. The component-wise extrema of K can be easily computed via linear programming. In 
other words, for i £ £, we can solve the following two linear programs: 

flower = min x . sub j ect t0 p x = ^LARS^ g x > Q) ( 3Q ) 
Supper = ^ gub j ect tQ p x = ^LARS^ ^ > /g-g 

xeRi £ i 

and then we know that the ith component of any lasso solution satisfies $i £ [/3j owor , /3" ppG1 ]. These 
bounds are tight, in the sense that each is achieved by the ith component of some lasso solution (in 
fact, this solution is just the minimizer of (1301) . or the maximizer of (13ip ). By the convexity of K, 
every value between f3\ owcr and /3" ppG1 is also achieved by the ith component of some lasso solution. 
Most importantly, the linear programs (|30|) and (|3Tj) can actually be solved in practice. Aside from 
the obvious dependence on y,X, and A, the relevant quantities P, S 1 , and /3^ ARS only depend on 
the equicorrelation set £ and signs s, which in turn only depend on the unique lasso fit. Therefore, 
one could compute any lasso solution (at y, A, A) in order to define £, s, and subsequently P, S and 
/3g ARS , all that is needed in order to solve (I3TJ1) and (l3~Tj) . We summarize this idea below. 

Algorithm 2 (Lasso coefficient bounds). 

Given y,X, and A > 0. 

1. Compute any solution /3 of the lasso problem (at y, X, X), to obtain the unique lasso fit X/3. 

2. Define the equicorrelation set £ and signs s, as in ([4]) and ([5]), respectively. 

3. Define P = P mw (x £ ), S = diag(s), and /3^ ARS according to (|23ll . 

4- For each i £ £ , compute the coefficient bounds ( (3] ower and /3" pper by solving the linear programs 
(|30p and (|3ip , respectively. 

Lemma ITTI implies a valuable property of the bounding interval [/3\° WCI , /3" ppor ] , namely, that this 
interval cannot contain zero in its interior. Otherwise, there would be a pair of lasso solutions with 
opposite signs over the ith component, contradicting the lemma. Also, we know from Lemma[T]that 
all lasso solutions have the same l\ norm L, and this means that |/3j ower |, |/3j Upper | < L. Combining 
these two properties gives the next lemma. 

Lemma 12. Fix any y, X, and A > 0. Let L be the common £i norm of lasso solutions at y,X, A. 
Then for any i £ £ , the coefficient bounds / 3' ower and f3^ pper defined in (|30p and (I3ip satisfy 

flower Jupporj Q ^ g . > ^ 

[A lower ,A uPper ]c[-£,0] z/ Sl <0. 

Using Algorithm [21 we can identify all variables i £ £ with one of two categories, based on their 
bounding intervals: 

(i) If G {f3\ OWCI , /3" ppcr ] , then variable i is called dispensable (to the lasso model at y,X,X), 
because there is a solution that does not include this variable in its active set. By Lemma [T2l 
this can only happen if /3] owor = or /3 l uppor = 0. 

(ii) If ^ [/3' owor , /?" ppcr ], then variable i is called indispensable (to the lasso model at y,X, A), 
because every solution includes this variable in its active set. By Lemma I12[ this can only 
happen if /3] owor > or /3 4 uppcr < 0. 
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It is helpful to return to the example discussed in the introduction. Recall that in this example 
we took n = 5 and p — 10, and for a given y,X, and A = 1, we found two lasso solutions: one 
supported on variables {1, 2, 3, 4}, and another supported on variables {1, 2, 3}. In the introduction, 
we purposely did not reveal the structure of the predictor matrix X] given what we showed in 
Section [2] (that X having columns in general position implies a unique lasso solution), it should 
not be surprising to find out that here we have X4 = (X2 + X^)/2. A complete description of our 
construction of X and y is as follows: we first drew the components of the columns Xi,X2,X% 
independently from a standard normal distribution, and then defined X4 = (X2 + X^)/2. We also 
drew the components of X5, . . .X\q independently from a standard normal distribution, and then 
orthogonalized X$, . . . Xio with respect to the linear subspace spanned by X\, . . . , X4. Finally, we 
defined y = —X\ + X2 + X3. The purpose of this construction was to make it easy to detect the 
relevant variables X%, . . . X4 for the linear model of y on X. 

According to the terminology defined above, variable 4 is dispensable to the lasso model when 
A = 1, because it has a nonzero coefficient at one solution but a zero coefficient at another. This is 
perhaps not surprising, as X 2 , X 3 ,X4 are linearly dependent. How about the other variables? We 
ran Algorithm [2] to answer this question. The results are displayed in Table [TJ 



i 


flower 


^LARS 


^uppcr 


1 


-0.8928 


-0.8928 


-0.8928 


2 


0.2455 


0.6201 


0.8687 


3 





0.3746 


0.6232 


4 





0.4973 


1.2465 



Table 1: The results of Algorithm^ for the small example from the introduction, with n = 5, p — 8. Shown 
are the lasso coefficient bounds over the equicorrelation set £ = {1, 2, 3, 4}. 

For the given y, X, and A = 1, the equicorrelation set is £ — {1,2,3,4}, and the sign vector is 
s = (—1, 1, 1, 1) T (these are given by running Steps 1 and 2 of Algorithm^. Therefore we know that 
any lasso solution has zero coefficients for variables 5, . . . 10, has a nonpositive first coefficient, and 
has nonnegative coefficients for variables 2, 3, 4. The third column of Table Q] shows the LARS lasso 
solution over the equicorrelation variables. The second and fourth columns show the component-wise 
coefficient bounds /3j owor and /3" pper , respectively, for i G £ . We see that variable 3 is dispensable, 
because it has a lower bound of zero, meaning that there exists a lasso solution that excludes the 
third variable from its active set (and this solution is actually computed by Algorithm [5J as it is 
the minimizer of the linear program (130[) with i — 3). The same conclusion holds for variable 4. On 
the other hand, variables 1 and 2 are indispensable, because their bounding intervals do not contain 
zero. 

Like variables 3 and 4, variable 2 is linearly dependent on the other variables (in the equicorre- 
lation set), but unlike variables 3 and 4, it is indispensable and hence assigned a nonzero coefficient 
in every lasso solution. This is the first of a few interesting points about dispensability and indis- 
pensability, which we discuss below. 

• Linear dependence does not imply dispensability. In the example, variable 2 is indispensable, 
as its coefficient has a lower bound of 0.2455 > 0, even though variable 2 is a linear function 
of variables 3 and 4. Note that in order for the 2nd variable to be dispensable, we need to 
be able to use the others (variables 1,3, and 4) to achieve both the same fit and the same t\ 
norm of the coefficient vector. The fact that variable 2 can be written as a linear function of 
variables 3 and 4 implies that we can preserve the fit, but not necessarily the t\ norm, with 
zero weight on variable 2. Table [T] says that we can make the weight on variable 2 as small 
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as 0.2455 while keeping the fit and the l\ norm unchanged, but that moving it below 0.2455 
(and maintaining the same fit) inflates the l\ norm. 

• Linear independence implies indispensability (almost everywhere) . In the next section we show 
that, given any X and A, and almost every y G R™, the quantity col(A/i) is invariant over 
all active sets coming from lasso solutions at y,X,\. Therefore, almost everywhere in y, if 
variable i G £ is linearly independent of all j £ £ (meaning that cannot be expressed as 
a linear function of Xj, j ^ £), then variable i must be indispensable — otherwise the span of 
the active variables would be different for different active sets. 

• Individual dispensability does not imply pairwise dispensability. Back to the above example, 
variables 3 and 4 are both dispensable, but this does not necessarily mean that there exists 
a lasso solution that exludes both 3 and 4 simultaneously from the active set. Note that the 
computed solution that achieves a value of zero for its 3rd coefficient (the minimizer of (f5U|) for 
i = 3) has a nonzero 4th coefficient, and the computed solution that achieves zero for its 4th 
coefficient (the minimizer of (13TJ1) for i = 4) has a nonzero 3rd coefficient. While this suggests 
that variables 3 and 4 cannot simultaneously be zero for the current problem, it does not serve 
as definitive proof of such a claim. However, we can check this claim by solving (|30[) . with 
i = 4, subject to the additional constraint that £3 = 0. This does in fact yield a positive 
lower bound, proving that variables 3 and 4 cannot both be zero at a solution. Furthermore, 
moving beyond pairwise interactions, we can actually enumerate all possible active sets of 
lasso solutions, by recognizing that there is a one-to-one correspondence between active sets 
and faces of the polytope K; see Appendix IA. 31 

Next, we cover some properties of lasso solutions that relate to our work in this section and in 
the previous two sections, on uniqueness and non-uniqueness. 

5 Related properties 

We present more properties of lasso solutions, relating to issues of uniqueness and non- uniqueness. 
The first three sections examine the active sets generated by lasso solutions of a given problem 
instance, when X is a general predictor matrix. The results in these three sections are reviewed 
from the literature. In the last section, we give a necessary condition for the uniqueness of the lasso 
solution. 

5.1 The largest active set 

For an arbitrary X, recall from Section 2] that the active set A of any lasso solution is necessarily 
contained in the equicorrelation set £ . We show that the LARS lasso solution has support on all 
of £ , making it the lasso solution with the largest support, for almost every y G R™. This result 
appeared in Tibshirani & Taylor (2012). 

Lemma 13. Fix any X and A > 0. For almost every y G R™, the LARS lasso solution /3 LARS has 
an active set A equal to the equicorrelation set £ , and therefore achieves the largest active set of any 
lasso solution. 

Proof. For a matrix A, let Aut denote its ith row. Define the set 

•^ = UU {^K":((X £ )+) [4] (z-(Xj) + A S )=0}. (32) 

The first union above is taken over all subsets £ C {1, . . .p} and sign vectors s G { — 1, 1}' £ ', but 
implicitly we exclude sets £ such that (Xg) + has a row that is entirely zero. Then A/" has measure 
zero, because it is a finite union of affinc subspaces of dimension n — 1. 
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Now let y ^ TV. We know that no row of (Xg ) + can be entirely zero (otherwise, this means that 
Xg has a zero column, implying that A = by definition of the equicorrelation set, contradicting 
the assumption in the lemma). Then by construction we have that /3^ ARS ^ for all i & £. □ 

Remark 1. In the case that the lasso solution is unique, this result says that the active set is equal 
to the equicorrelation set, almost everywhere. 

Remark 2. Note that the equicorrelation set £ (and hence the active set of a lasso solution, almost 
everywhere) can have size \£\ — p in the worst case, even when p > n. As a trivial example, consider 
the case when X € R nxp has p duplicate columns, with p > n. 

5.2 The smallest active set 

We have shown that the LARS lasso solution attains the largest possible active set, and so a natural 
question is: what is the smallest possible active set? The next result is from Osborne et al. (20006) 
and Rosset et al. (2004). 

Lemma 14. For any y,X, and A > 0, there exists a lasso solution whose set of active variables is 
linearly independent. In particular, this means that there exists a solution whose active set A has 
size \A\ < min{n,p}. 

Proof. We follow the proof of Rosset et al. (2004) closely. Let (3 be a lasso solution, let A = supp(/3) 
be its active set, and suppose that rank(X_4) < \A\. Then by the same arguments as those given in 
Section [51 we can write, for some i £ A, 

SiXi = ajSjXj, where aj = 1. (33) 

j£A\{i} 3£A\{i} 

Now define 

6i = —Si and 9j = ajSj for j G A \ {i}. 
Starting at j3, we move in the direction of 9 until a coefficient hits zero; that is, we define 

P- A = and Pa = Pa + S8, 

where 

5 = min{p > : f3j + p9j = for some j 6 .4}. 

Notice that S is guaranteed to be finite, as 8 < \$i\. Furthermore, we have Xf3 
9 G null(A/i), and also 

ll/3||i = |ft|+ £ 

j£A\{i} 

= 1/3,1-5+ ]T (Ifrl+Saj) 

jEA\{i} 

Hence we have shown that (3 achieves the same fit and the same l\ norm as $, so it is indeed also 
lasso solution, and it has one fewer nonzero coefficient than /3. We can now repeat this procedure 
until we obtain a lasso solution whose active set A satisfies rank(A_4) = \A\. □ 

Remark 1. This result shows that, for any problem instance, there exists a lasso solution supported 
on < min{n,p} variables; some works in the literature have misquoted this result by claiming that 



= X/3 because 
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every lasso solution is supported on < mm{n,p} variables, which is clearly incorrect. When the 
lasso solution is unique, however, Lemma 1 1 41 implies that its active set has size < min{n,j?}. 

Remark 2. In principle, one could start with any lasso solution, and follow the proof of Lemma 
[Til to construct a solution whose active set A is such that rank(X_4) = \A\. But from a practical 
perspective, this could be computationally quite difficult, as computing the constants aj in (f3"3")l 
requires finding a nonzero vector in null(X_4) — a nontrivial task that would need to be repeated 
each time a variable is eliminated from the active set. To the best of our knowledge, the standard 
optimization algorithms for the lasso problem (such as coordinate descent, first-order methods, 
quadratic programming approaches) do not consistently produce lasso solutions with the property 
that rank(X^) = |„4 over the active set A. This is in contrast to the solution with largest active 
set, which is computed by the LARS algorithm. 

Remark 3. The proof of Lemma 1141 does not actually depend on the lasso problem in particular, and 
the arguments can be extended to cover the general l\ penalized minimization problem (jll[) . with 
/ differentiable and strictly convex. (This is in the same spirit as our extension of lasso uniqueness 
results to this general problem in Section O) Hence, to put it explicitly, for any differentiable, 
strictly convex /, any X, and A > 0, there exists a solution of (fTTj) whose active set A is such that 
rank(X^) = \A\. 

The title "smallest" active set is justified, because in the next section we show that the subspace 
col(X./v) is invariant under all choices of active sets A, for almost every y £ R". Therefore, for such 
y, if A is an active set satisfying rank(X_4) = |„4|, then one cannot possibly find a solution whose 
active set has size < \A\, as this would necessarily change the span of the active variables. 

5.3 Equivalence of active subspaces 

With the multiplicity of active sets (corresponding to lasso solutions of a given problem instance), 
there may be difficulty in identifying and interpreting important variables, as discussed in the in- 
troduction and in Section |4j Fortunately, it turns out that for almost every y, the span of the 
active variables does not depend on the choice of lasso solution, as shown in Tibshirani & Taylor 
(2012). Therefore, even though the linear models (given by lasso solutions) may report differences 
in individual variables, they are more or less equivalent in terms of their scope, almost everywhere 
in y. 

Lemma 15. Fix any X and A > 0. For almost every y £ R™, the linear subspace coI(Xa) is exactly 
the same for any active set A coming from a lasso solution. 

Due to the length and technical nature of the proof, we only give a sketch here, and refer the 
reader to Tibshirani & Taylor (2012) for full details. First, we define a set Af C R™ — somewhat like 
the set defined in (|32|) in the proof of Lemma [13] — to be a union of affine subspaces of dimension 
< n — 1, and hence Af has measure zero. Then, for any y except in this exceptional set Af, we 
consider any lasso solution at y and examine its active set A. Based on the careful construction of 
Af, we can prove the existence of an open set U containing y such that any y' £ U admits a lasso 
solution that has an active set A. In other words, this is a result on the local stability of lasso active 
sets. Next, over U, the lasso fit can be expressed in terms of the projection map onto col(X_4). The 
uniqueness of the lasso fit finally implies that co\(Xj\) is the same for any choice of active set A 
coming from a lasso solution at y. 

5.4 A necessary condition for uniqueness (almost everywhere) 

We now give a necessary condition for uniqueness of the lasso solution, that holds for almost every 
y £ R™ (considering X and A fixed but arbitrary) . This is in fact the same as the sufficient condition 
given in Lemma [2] and hence, for almost every y, we have characterized uniqueness completely. 
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Lemma 16. Fix any X and A > 0. For almost every y S K™, if the lasso solution is unique, then 
null(A £ ) = {0}. 

Proof. Let N be as defined in (|32j) . Then for y £J\f, the LARS lasso solution /3 LARS has active set 
equal to £ . If the lasso solution is unique, then it must be the LARS lasso solution. Now suppose 
that null(Af) ^ {0}, and take any b £ null(A£), 6^0. As the LARS lasso solution is supported on 
all of £, we know that 

s . . ^lars > for all • e £ 

For S > 0, define 

/3_ £ = and p s = Ps ARS + 8b. 

Then we know that 

5b e null(A £ ) and s 4 • (/^ ARS + (56,) > 0, ie£, 

the above inequality holding for small enough 5 > 0, by continuity. Therefore /3 ^ ^LARS j s gj so a 
solution, contradicting uniqueness, which means that null(A^) = {0}. □ 

6 Discussion 

We studied the lasso problem, covering conditions for uniqueness, as well as results aimed at better 
understanding the behavior of lasso solutions in the non-unique case. Some of the results presented 
in this paper were already known in the literature, and others were novel. We give a summary here. 

Section [5] showed that any one of the following three conditions is sufficient for uniqueness of the 
lasso solution: (i) null(A,r) = {0}, where £ is the unique equicorrelation set; (ii) X has columns in 
general position; (iii) X has entries drawn from a continuous probability distribution (the implication 
now being uniqueness with probability one). These results can all be found in the literature, in one 
form or another. They also apply to a more general t\ penalized minimization problem, provided 
that the loss function is differentiable and strictly convex when considered a function of Xfi (this 
covers, for example, t\ penalized logistic regression and t\ penalized Poisson regression) . Section [5] 
showed that for the lasso problem, the condition null(Af) = {0} is also necessary for uniqueness of 
the solution, almost everywhere in y. To the best of our knowledge, this is a new result. 

Sections [3] and |4] contained novel work on extending the LARS path algorithm to the non- unique 
case, and on bounding the coefficients of lasso solutions in the non-unique case, respectively. The 
newly proposed LARS algorithm works for any predictor matrix X, whereas the original LARS 
algorithm only works when the lasso solution path is unique. Although our extension may superfi- 
cially appear to be quite minor, its proof of correctness is somewhat more involved. In Section [3] we 
also discussed some interesting properties of LARS lasso solutions in the non-unique case. Section 
|4] derived a simple method for computing marginal lower and upper bounds for the coefficients of 
lasso solutions of any given problem instance. It is also in this section that we showed that no two 
lasso solutions can exhibit different signs for a common active variable, implying that the bounding 
intervals cannot contain zero in their interiors. These intervals allowed us to categorize each equicor- 
relation variable as either "dispensable" — meaning that some lasso solution excludes this variable 
from active set, or "indispensable" — meaning that every lasso solution includes this variable in its 
active set. We hope that this represents progress towards interpretation in the non-unique case. 

Finally, the remainder of Section [S] reviewed existing results from the literature on the active 
sets of lasso solutions in the non-unique case. The first was the fact that the LARS lasso solution is 
fully supported on £, and hence attains the largest active set, almost everywhere in y. Next, there 
always exists a lasso solution whose active set A satisfies rank(A^) = \A\, and therefore has size 
„4| < min{n,p}. The last result gave an equivalence between all active sets of lasso solutions of a 
given problem instance: for almost every y, the subspace col(A./v) is the same for any active set A 
of a lasso solution. 
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A Appendix 

A.l Proof of correctness of the LARS algorithm 

We prove that for a general X, the LARS algorithm (Algorithm [TJ) computes a lasso solution path, 
by induction on k, the iteration counter. The key result is Lemma |17[ which shows that the LARS 
lasso solution is continuous at each knot X k in the path, as we change the equicorrelation set and 
signs from one iteration to the next. We delay the presentation and proof of Lemma [T7] until we 
discuss the proof of correctness, for the sake of clarity. 

The base case k = is straightforward, hence assume that the computed path is a solution path 
through iteration k — 1, that is, for all A > Afc. Consider the fcth iteration, and let £ and s denote 
the current equicorrelation set and signs. First we note that the LARS lasso solution, as defined in 
terms of the current £,s, satisfies the KKT conditions at Afc. This is implied by Lemma [T71 and the 
fact that the KKT conditions were satisfied at Afc with the old equicorrelation set and signs. To be 
more explicit, Lemma [TH and the inductive hypothesis together imply that 

\\X T £ (y - X/3 LARS (A fe ))||co < Xk, X£(y - A/3 LARS (A fe )) = X k s, 

and s — sign(/3g ARS (Afe)), which verifies the KKT conditions at Afc. Now note that for any A < Afc 
(recalling the definition of /3 LARS (A)), we have 



Xj(y - A/? LARS (A)) = Xjy - XjX £ (X £ )+y + Xj(Xj)+Xs 
= Aj(Aj) + A S 
= \s, 

where the last equality holds as s £ row(X £ ). Therefore, as A decreases, only one of the following 



\ cross 



\ cross 
A fc+1> A fc+ 



Since we only 

^ s }, we have hence verified the KKT conditions for A > X k +i, 



two conditions can break: \\X± e (y - X0 LABS (X)\\ oo < A, or s = sign(/3£ ARS (A)). The first breaks 
at the next joining time A J fc °^, and the second breaks at the next crossing time A£ 
decrease A to Afc + i = max{A J fc om 
completing the proof. 

Now we present Lemma \T7\ which shows that /3 LARS (A) is continuous (considered as a function 
of A) at every knot Afc. This means that the constructed solution path is also globally continuous, 
as it is simply a linear function between knots. We note that Tibshirani & Taylor (2011) proved a 
parallel lemma (of the same name) for their dual path algorithm for the generalized lasso. 

Lemma 17 (The insertion-deletion lemma). At the kth iteration of the LARS algorithm, let £ 
and s denote the equicorrelation set and signs, and let £* and s* denote the same quantities at the 
beginning of the next iteration. The two possibilities are: 

1. (Insertion) If a variable joins the equicorrelation set at Xk+i, that is, £* and s* are formed by 
adding elements to £ and s, then: 



(X £ )+{y-(Xj)+X k+1 s) 




(X £ ,)+(y-(Xl)+X k+1 s*) 
(X £ ,)+{y-(Xl)+X k+1 s*) 



(34) 
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2. (Deletion) If a variable leaves the equicorrelation set at Xk+i, that is, £* and s* are formed by 
deleting elements from £ and s, then: 



(X £ )+(y-(Xj)+\ k+1 s) 
(X £ )+(y-(Xj)+X k+1 s) 



(X £ ,)+(y-(Xl)+X k+1 s) 




(35) 



Proof. We prove each case separately. The deletion case is actually easier so we start with this first. 
Case 2: Deletion. Let 



Xl 

x 2 



(X £ )+(y-(Xj)+\ k+1 s) 
(X £ )+(y-(Xl) + \ k+1 s) 



the left-hand side of (l3"5j) . By definition, we have x 2 — because variable i£+i s crosses through 
zero at Afc+i- Now we consider x\. Assume without a loss of generality that iS!f s is the last of the 
equicorrelation variables, so that we can write 



X\ 
X2 



= (X £ )+(y-(Xj)+X k+1 s). 



The point (xi, X2) T is the minimum I2 norm solution of the linear equation: 



X £ X £ 



Xl 
X2 



Xjy - X k+ is. 



Decomposing this into blocks, 

Xj,X £ * 

Xposs X £ * X -cross X., 



X c* Xjcioss 

t 6 fc+l 



T 





Xl 




r xj, - 


V - Afe+l 


s* 




. X2 




XT oroM 

k + l 


cross 



Solving this for x\ gives 

Xl = (Xj,X £ ,) + [Xj,y - X k+1 s* - XlX if ~-x*\ + b 
= (X £ ,)+(y~(Xl) + X k+1 s*)+b, 
where b G null(Xg»). Recalling that x± must have minimal £2 norm, we compute 



IM^= (X £ ,)+(y~(Xl)+X k+1 s*) +\\b\ 



12) 



which is smallest when b — 0. This completes the proof. 

Case 1: Insertion. This proof is similar, but only a little more complicated. Now we let 



x% 



(X £ ,)+(y-(Xl)+X k+1 s*) 
(X £ ,)+{y-(Xl) + X k+1 s*) 



the right-hand side of Assuming without a loss of generality that i J fe °j_" is the largest of the 

equicorrelation variables, the point (x\, X2) T is the minimum £2 norm solution to the linear equation: 



X £ ,X £ * 



X2 



X £ ,y - X k+ is*. 
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If we now decompose this into blocks, we get 



XjX £ XjXjoin 



X.j oijl X£ X .j oia Xjoin 
l k + l l k + l fc + l 



X 2 



xl ia 

l k + l _ 



fc+1 



s 

join 



Solving this system for x\ in terms of xi gives 



x x = {XjX £ ) 



Xjy-X k+1 s-XjX, ain x 2 



(X £ )+ y - (Xj)+\ k+lS - AjA> n x 2 



where b G null(Xg), and as we argued in the deletion case, we know that b = in order for x\ to 
have minimal £2 norm. Therefore we only need to show that x 2 = 0. To do this, we solve for x 2 in 
the above block system, plug in what we know about x\, and after a bit of calculation we get 



X2 



X^(I-P)X^ 

fc + 1 



xl, a 



(I-P)y + (Xl)+\ k+lS 



AS fc+l 



where we have abbreviated P = P co i(x £ )- But the expression inside the parentheses above is exactly 

Xj oin ( y -A/3 LARS (A fe+1 ))-A s £ 1= 0, 

by definition of the joining time. Hence we conclude that X2 = 0, as desired, and this completes the 
proof. □ 



A. 2 Local LARS algorithm for the lasso path 

We argue that there is nothing special about starting the LARS path algorithm at A = 00. Given 
any solution the lasso problem at y,X, and A* > 0, we can define the unique equicorrelation set £ 
and signs s, as in dU and (|5|). The LARS lasso solution at A* can then be explicitly constructed 
as in ([231) , and by following the same steps as those outlined in Section 13.11 we can compute the 
LARS lasso solution path beginning at A*, for decreasing values of the tuning parameter; that is, 
over A G [0, A*]. 

In fact, the LARS lasso path can also be computed in the reverse direction, for increasing values 
of the tuning parameter. Beginning with the LARS lasso solution at A*, it is not hard to see that in 
this direction (increasing A) a variable enters the equicorrelation set at the next crossing time — the 
minimal crossing time larger than A*, and a variable leaves the equicorrelation set at the next joining 
time — the minimal joining time larger than A*. This is of course the opposite of the behavior of 
joining and crossing times in the usual direction (decreasing A). Hence, in this manner, we can 
compute the LARS lasso path over A £ [A* , 00] . 

This could be useful in studying a large lasso problem: if we knew a tuning parameter value A* 
of interest (even approximate interest), then we could compute a lasso solution at A* using one of 
the many efficient techniques from convex optimization (such as coordinate descent, or accelerated 
first-order methods), and subsequently compute a local solution path around A* to investigate the 
behavior of nearby lasso solutions. This can be achieved by finding the knots to the left and right of 
A* (performing one LARS iteration in the usual direction and one iteration in the reverse direction), 
and repeating this, until a desired range A G [A* — Sl, A* + 5r] is achieved. 



A. 3 Enumerating all active sets of lasso solutions 

We show that the facial structure of the polytope K in (|29|) describes the collection of active sets of 
lasso solutions, almost everywhere in y. 
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Lemma 18. Fix any X and A > 0. For almost every y S R n , there is a one-to-one correspondence 
between active sets of lasso solutions and nonempty faces of the polyhedron K defined in (|29[) . 



Proof. Nonnempty faces of K are sets F of the form F = K n H ^ 0, where H is a supporting 
hyperplane to K . If ^4 is an active set of a lasso solution, then there exists an x G K such that 
x £ \a — 0. Hence, recalling the sign condition in (|28p. the hyperplane H £ \^ — {% € R' £ ' : u T x = 0}, 
where 

| Si if i e £ \ A 

supports K. Furthermore, we have F = KfMT = {x £ K : J2ies\A SiXi = 0} = i x e K '■ X £\A = 0}. 
Therefore every active set A corresponds to a nonempty face F of K. 

Now we show the converse statement holds, for almost every y. Well, the facets of K are sets 
of the form Ft = K n {x g R' £ l : irj = 0} for some i g £@ Each nonempty proper face F can be 
written as an intersection of facets: F — Di^xFi = {x G K : xx = 0}, and hence F corresponds to 
the active set A — £ \ I. The face F = K corresponds to the equicorrelation set £ , which itself is 
an active set for almost every y £ W 1 by Lemma [T21 □ 

Note that this means that we can enumerate all possible active sets of lasso solutions, at a given 
y, X, A, by enumerating the faces of the polytope K. This is a well-studied problem in computational 
geometry; see, for example, Fukuda et al. (1997) and the references therein. It is worth mentioning 
that this could be computationally intensive, as the number of faces can grow very large, even for a 
polytope of moderate dimensions. 



References 

Candes, E. J. & Plan, Y. (2009), 'Near ideal model selection by l± minimization', Annals of Statistics 
37(5), 2145-2177. 

Chen, S., Donoho, D. L. & Saunders, M. (1998), 'Atomic decomposition for basis pursuit', SIAM 
Journal on Scientific Computing 20(1), 33-61. 

Donoho, D. L. (2006), 'For most large underdetermined systems of linear equations, the minimal 
l\ solution is also the sparsest solution', Communications on Pure and Applied Mathematics 
59(6), 797-829. 

Dossal, C. (2012), 'A necessary and sufficient condition for exact sparse recovery by l\ minimization', 
Comptes Rendus Mathematique 350(1-2), 117-120. 

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), 'Least angle regression', Annals of 
Statistics 32(2), 407-499. 

Fuchs, J. J. (2005), 'Recovery of exact sparse representations in the presense of bounded noise', 
IEEE Transactions on Information Theory 51(10), 3601-3608. 

Fukuda, K., Liebling, T. M. & Margot, F. (1997), 'Analysis of backtrack algorithms for listing all ver- 
tices and all faces of a convex polyhedron', Computational Geometry: Theory and Applications 
8(1), 1-12. 

Mairal, J. & Yu, B. (2012), Complexity analysis of the lasso regularization path. arXiv: 1205.0079. 

4 This is slightly abusing the notion of a facet, but the argument here can be made rigorous by reparametrizing the 
coordinates in terms of the affine subspace {x £ Rj £ l : Px = /3g AIls }. 



24 



Osborne, M., Presnell, B. & Turlach, B. (2000a), 'A new approach to variable selection in least 
squares problems', IMA Journal of Numerical Analysis 20(3), 389-404. 

Osborne, M., Presnell, B. & Turlach, B. (2000&), 'On the lasso and its dual', Journal of Computa- 
tional and Graphical Statistics 9(2), 319-337. 

Rockafellar, R. T. (1970), Convex Analysis, Princeton University Press, Princeton. 

Rosset, S., Zhu, J. & Hastie, T. (2004), 'Boosting as a regularized path to a maximum margin 
classifier', Journal of Machine Learning Research 5, 941-973. 

Tibshirani, R. (1996), 'Regression shrinkage and selection via the lasso', Journal of the Royal Sta- 
tistical Society: Series B 58(1), 267-288. 

Tibshirani, R. J. (2011), The Solution Path of the Generalized Lasso, PhD thesis, Department of 
Statistics, Stanford University. 

Tibshirani, R. J. & Taylor, J. (2011), Proofs and technical details for "The solution path of the 
generalized lasso" . 

URL: \http : //www, stat. emu, edu / -ryantibs/papers/ genlasso-supp . pdf\ 

Tibshirani, R. J. & Taylor, J. (2012), 'Degrees of freedom in lasso problems', Annals of Statistics 
40(2), 1198-1232.' 

Wainwright, M. J. (2009), 'Sharp thresholds for high-dimensional and noisy sparsity recovery us- 
ing ^-constrained quadratic programming (lasso)', IEEE Transactions on Information Theory 
55(5), 2183-2202. 

Zou, H. & Hastie, T. (2005), 'Regularization and variable selection via the elastic net', Journal of 
the Royal Statistical Society: Series B 67(2), 301-320. 



25 



